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ABSTRACT 

This  paper  discusses  a  linear  classifier  based  on  linear 
programming  which  is  adaptive  to  a  change  in  the  set  of  input  vectors. 
Different  from  linear  classifiers  discussed  in  the  past,  this  linear 
classifier  maintains  the  maximum  rel  ability  of  its  operation.   A 
procedure  of  deriving  an  optimum  structure  of  the  linear  classifier  for 
a  new  set  of  input  vectors  is  a  modification  of  the  ordinary  simplex 
method  and  yields  an  optimum  strucutre  in  a  much  fewer  iterations  than 
the  determination  of  the  strucutre  by  straightforward  application  of  the 
ordinary  simplex  method. 

The  adaptive  procedure  then  is  extended  to  other  cases  such  as  a 
linear  classifier  which  maintains  the  minimum  number  of  erroneously 
classified  input  vectors.   This  is  bases  on  Gomory's  algorithm  for 
integer  linear  programming. 

The  feasibility  and  efficiency  of  our  linear  classifiers  are 

computationally  proved  by  some  examples. 

Index  Terms 

Adaptive  network 

Simplex  method 

Linear  programming 

Integer  linear  programming 

Gomory ' s  algorithm 

Dual  linear  programming 

Linear  classifier 

Reliability  of  adaptive  network 

Input  tolerance  of  adaptive  network 


1.   Introduction 

A  linear  classifier  for  pattern  recognition  has  received  much 
attention  recently  for  its  cognitive  power  with  simple  structure  and 
also  as  a  constituent  of  a  more  complex  classifier    '  .   There 

are  at  least  two  types  of  linear  classifiers  distinguished  "by  their 
design,  i.e.  .the  one  designed  "by  adaptive  (learning)  methods  and  the 
other  designed  by  linear  programming.   In  this  paper  we  will  discuss  a 
linear  classifier  which  is  based  on  linear  programming  but  which  is 
augmented  with  adaptive  capability. 

An  adaptive  method  was  first  discussed  in  conjunction  with  the 

[2] 
learning  capacity  of  neural  networks    .   It  has  some  merits  such  as 

(l)  simplicity  of  training  rule,  (2)  economy  of  memory  space  and  (3) 

flexibility,  in  other  words,  it  can  adjust  itself  easily  according  to 

the  change  of  environment  (pattern  vectors  which  are  being  processed) 

by  constantly  applying  the  training  procedure.   If  the  given  pattern 

vectors  are  not  linearly  separable,  however,  the  successive  changes  in 

the  structure  of  an  adaptive  classifier  is  not  simple.   And  even  if 

the  patterns  of  a  pattern  classifier  are  separable,  the  concept  of 

optimality  is  not  incorporated  in  the  adaptive  methods. 

Linear  programming  methods,  on  the  other  hand,  guarantee  the 

optimality  of  a  given  linear  function  of  the  network  parameters  such 

as  the  sum  of  all  weights.         The  paper  by  Smith    shows  that 

computation  time  for  adaptive  methods  is  less  than  that  for  the  linear 

programming  methods  when  the  number  of  pattern  vectors  is  much  greater 

than  the  dimension  of  vectors,  and  that  the  linear  programming  methods 

are  preferable  otherwise.   However  all  these  papers  deal  only  with  the 
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static  case  in  which  only  a  single  set  of  pattern  vectors  is  given  and 
do  not  deal  with  a  general  dynamic  case  when  new  pattern  vectors  re- 
place some  of  the  current  vectors.   Thus  linear  classifiers  discussed 
by  them  cannot  adapt  to  a  changing  environment. 

Our  objective  of  this  paper  is  to  show  a  linear  programming 
method  which  has  the  capability  of  adaptation  even  in  changing  en- 
vironment.  In  other  words,  our  linear  classifier  can  adjust  itself  to 
a  changing  set  of  pattern  vectors,  always  keeping  the  optimum  structure 
of  the  classifier  and  without  increase  of  the  size  of  storage  space  to 
be  used.   This  paper  will  discuss  also  a  linear  classifier  based  on  an 
integer  linear  programming  method  which  minimizes  the  number  of  error 
vectors,   even  if  current  set  of  pattern  vectors  is  not  linearly  separable, 
and  which  also  adapts  to  a  changing  set  of  pattern  vectors. 


2.      Problem  Statement 

We   consider  a  linear  classifier  which  separates  M     N-di- 
mensional  pattern  vectors 

If*35    =    (^\    '•-,    4j)    )  3=1,    2,    ...,    M  (2.1) 

into  two  categories  A  and  B  for  which  the  output  f  of  the  classifier 
is  assumed  1  and  0  respectively.   The  coordinates  £.    are  real  num- 
bers.  An  objective  in  designing  a  linear  classifier  is  determination 
of  a  weight  vector 

w  =  (w  ,  .. .,  wN)  (2.2) 

and  a  constant  value 

wQ  (2.3) 

of  real  numbers  which  satisfy 

v«V        +  wQ  >  d     if  f  =  1 
w.|vjy  +  wQ  <-d     if  f  =  0 

0=1,  ...,   M  , 

-  -4i)  N       (i) 

where  w  •  I   °        is  the  inner  product  Z   w.  £.     and  d  >  0  is  called 

i=l   X  X 
the  margin  of  the  classifier*.   The  margin  is  provided  to  secure  reliable 


Even  if  the  value  of  the  left  hand  side  of  (2.k)    does  not  exceed 
d  (does  not  reach  -d)  for  the  vector  of  category  A(B),  the  classifier 
is  supposed  to  work  properly  as  long  as  the  value  is  positive 
(negative).   This  is  why  d  is  called  as  the  "margin",   d  is  usually 
set  to  1  without  loss  of  generality. 


operation  even  when  small  deviation  in  w  •   p*"   is  caused  by  noise. 

In  order  to  facilitate  our  computation  based  on  linear 
programming,   let  each  variable  w.   be  decomposed  into  non-negative 


variables   as  follows 


where 


w.    =  w.+     -  wi"    ^  (i  =  0,   1,  2,    ...,   N) 


w±+>0,   w±-     >     0    .  (i   =0,    1,   2,    ...,    N)  (2.5) 

Thus  (2.k)   is  rewritten  as 

N  ,  .v 

S  (w1~  -  w7)  ^j;  >  d     if  f  =  1 
i=0        x   x 

N  ,  .x 

2  (w+  -  wT)  i\3)   <  -d     if  f  =  0,    j=l,  ...,  M  , 
i=0  ■"■ 

where  ^   =   1.  (2#6) 

If  (2.6)  (i.e.  (2.1*))  is  consistent,  there  are  generally  an 
infinite  number  of  solutions  (w,w  ).   Henceforth  we  will  consider  only 
a  solution  which  minimizes  a  linear  objective  function  of  the  weight 
vector 

u  •  w^  ,  (2.7) 

where  u  is  a  (2N  +  2)  -  dimensional  vector  and 


W±=  (W0+'  W0,  ^•••wn)-  (2.8) 

The  coordinates  u  will  be  determined  in  the  following  paragraph,  depending 
upon  which  parameters  of  the  classifier  we  want  to  minimize.  The  minimi- 
zation of  the  objective  function  (2.7)  under  the  constraints  (2.5)  and 
(2.6)  is  a  typical  linear  programming  problem. 

The  objective  function  is  expressed  in  a  general  form  in  (2.7). 
However  it  can  be  correlated  to  the  reliable  operation  of  the  classifier 


as  follows.   Assume  that  the  deviation  of  each  input  |.   due  to  noise 

or  other  fluctuations  in  the  circuit  of  the  classifier  is  8..   Then 

1 

the  actual  value  of  the  left  side  of  (2.h)    is 

N        ,  .n  N       ,  .s     N 

Z     w±  (i±Vj;  +  o±)  +  wQ  =  E  w±  g±U'  +  E  wi  S  +  w  . 
i=l  i=l  i=l 

Now  let  6  he  the  maximum  of  |£ . |  over  all  i; 

|6.  |  <  6  ,  i=l,  2,    ...,    N. 

Then  we  have 

N  N 

I  E  w.  5.  I  <  5  Z      I  w.  I  . 
1=1  1=1 

Therefore  if  6  satisfies 

N      /  .n      N 

E  w.  g: JJ  -  8  2   |w. I  +  w^  >  0     if  f  =  1 
1=1  X  X       1=1   X     °" 

(2.9) 
N      .    .  N 

E  w±  |^D;  +  &  Z   |wi|  +  wQ  <  0     if  f  =  0,  j  =  1,  ...,  M, 

the  linear  classifier  operates  correctly*  even  when  all  5 .  have  the 

maximum  deviation  +8  or  -5-  (i.e.  |&.|  =  8  for  all  i) .   The  max- 
imum value  of  8  such  that  the  classifier  operates  correctly  is  called 
the  input  tolerance.   Denoting  the  input  tolerance  with  y,   we  have 

(j) 


l.z  wi  V  +wol 

=  Min  i~l  x  x 


y   =  Min  i~l   L  x  Ul  (2.10) 

Z      Iw.l 
i=l 

by  (2.9).   If  we  determine  a  solution  (w,  w  )  such  that  7  is  maximized, 


See  the  footnote  on  page  3/ 


the  classifier  is  allowed  to  have  the  maximum  deviation  in  its  inputs 
and  therefore  we  have  maximized  the  reliability  of'  the  operation  of  the 

classifier.  We  will  prove  that  the  maximization  of  (2.10)  is  equivalent 

N 
to  the  minimization  of  Z   w.  . 

i=l   X 

First  of  all,  if  w  is  a  weight  vector  which  maximizes  y   then 
k*w  also  provides  the  same  input  tolerance,  where  k  is  any  positive 
number  such  that  k  •  w  is  still  a  solution  of  (2.U),  Now  we  can  assume 

N      (  .s 

Min   |   Z  V±  V±3)   +  wQ  |  =  d  (2.11) 

J     i=l 

N     ,    v 
without  loss  of  generality.   If  Min  |   Z  w.|.j;+  w    =  t  >  d,  we 

i=l 

can  simply  multiply  a  certain  positive  number  k  =  —  <  1  in  order  to 

obtain  (2.1l)  which  of  course  still  satisfies  (2.k). 

Consequently  the  maximization  of  y   is  to  minimize 

N 

Z   I  w. 
i=l    X 

under  conditions  (2.k)    and  (2.1l).   In  this  case,  however,  we  can 

delete  condition  (2.1l).   This  is  because  if  we  minijnize 

N 

under  condition  (2.k)    and  obtain 

Min  |   Z  w.|:j;  +  wQ  |   =  e>d, 

then  -  <  1  and 
e 

d   - 
w'  =  —  •  w 
e 


satisfies  (2.k).      For  this  new  weight  vector  w 


N 

N 

z    1 

1=1 

w! 

1 

<  Z 
i=l 

w. 

i 

holds. 

Thus  a  contradiction.   Therefore  we  must  have  e  =  d,  i.e.  (2.11)  is 

satisfied.   An  alternative  proof  is  found  in  ;?J;,L 

By  setting 

u«  =  u,=  0  and  u.  =  u.  =  1  for  i  =  1,  2,  ....  N, 

0    0  li  '      '  '      ' 

the  objective  function  (2.7)  becomes 

N 

+ 


Z   (wT  +  wT  )  (2.12) 

.,11  v     ' 

i=l 

N       ^ 
which  represents  Z   '  w.  '  .   Solution  of  the  linear  program  composed  of 
i=l    X 

(2.12),  (2.5)  and  (2.6)  will  lead  to  the  design  of  a  linear  classifier 
with  the  maximum  reliability  of  operation. 

Another  case  was  discussed  in  the  papers         when  the 
values  of  |.   are  limited  to  +1  and  -1,  instead  of  real  numbers.   If 

w.  £.    (i=0,  1,  2,    ...,   N)  is  permitted  to  deviate  as  much  as  w.5  .   (l+s) 

or  w.  |.    (1-5)  where  5  is  now  a  percentage  deviation,  then  the  input 

tolerance  of  a  majority  element  is 

N      ,  x 
1  Min  !   Z  wA.}    !   ,  (2.13) 

W   J    1=0 


where 


N 
W  =  Z    I   w. 
i=0 


*  Obviously  the  minimization  of  (2.12)  leads  to  the  condition  that 

N  N 

either  wt  or  w?  is  always  0.   Then  Z   (w.  +  w7  )   =2    lw  '   follows. 


i=l    A    ^     i=l 


And  it  was  proved  that  the  minimization  of  W  is  equivalent  to  the  max- 
imization of  the  input  tolerance,  i.e.  the  maximization  of  the  reliabil- 
ity of  operation.   For  this  case  we  set  u.  =  U.  =  1,  i  =  0,  1,  2,  . ..,  N 
in  (2.7). 

Although  the  input  tolerance  is  a  typical  objective  to  be  optimized, 
if  a  certain  parameter  of  the  classifier  can  be  represented  in  the  form 
of  (2.7),  we  can  optimize  other  characteristic  of  the  classifier  rather 
than  the  input  tolerance.   In  particular,  when  the  integer  programming  is 
used,  a  wider  variety  of  objective  functions  may  be  available  for  our  choice. 

So  far  we  have  assumed  that  the  linear  classifier  processes  the 

set  of  distinct  patterns   (I    ,...,£    ).   In  other  words,  only 

~*  (1)       ^  (M) 

I    ,  ...,  i  are  supplied  repeatedly  to  the  linear  classifier  as  a 

r(t  -1)    r  (t)    r  (t  +  1)    r  (t  +  2)        _     .     . 

time  series   . ..,  £      ,    \  ,|        ,|        ,  ...  The  structure  of 

the  linear  classifier  was  optimized  for  the  set  {%  ,  ...  |    }. 

Let  us  consider  the  case  where  the  time  series  of  pattern  vectors 
gradually  contain  new  pattern  vectors  for  which  the  structure  of  the 
linear  classifier  is  not  optimum  and  instead  some  of  the  existing  pattern 
vectors  are  eliminated  from  the  time  series.   A  few  different  schemes  are 
conceivable  to  find  whether  the  incoming  pattern  vector  is  considered  as 
a  new  pattern  vector  to  which  the  structure  of  the  classifier  is  optimized. 


These  schemes  will  be  discussed  in  Section  6.   Here  for  the  moment  let 
us  assume  that  the  incoming  pattern  vector  is  found  new  by  a  certain 
scheme  at  some  instant  and  that  the  linear  classifier  is  supposed  to 
process  the  new  set  of  distinct  pattern  vectors  (!,...,£        } 
instead  of  [  |    ,  . ..,  £    } .  In  other  words,  |     is  replaced  by  \ 
Now  the  structure  of  the  classifier  is  supposed  to  be  optimized  for 
(£,...,£        }•   The  linear  classifier  should  adapt  itself  to 
this  new  environment. 

By  changing  the  notations  appropriately  and  multiplying  the  second 
inequality  of  (2.6)  by  -1,  our  linear  program  to  minimize  (2.12)  under  the 
constraints  (2.5)  and  (2.6)  is  converted  into 

minimize  c  x 
subject  to  A  x  >  b 

x  >  0,  (2.1U) 


where  x  is  an  n-dimensional  vector  of  unknown  variables  to  be  determined 
and 


(C;L,  •-.,  cn), 


A  = 


ct_  -    •  •  •  •    cl_ 

11        In 


ml 


mn 


(2.15) 


b  =  (b_ ,  ...,b). 
1       m 

Note  that  A  corresponds  to  the  original  given  set  of  pattern  vectors, 
b  to  d's,   and  x  to  the  original  w.   c  represents  the  coefficients  of  the 
objective  function.   Thus  the  adaptability  problem  of  the  classifier  with  the 
change  of  environment  may  be  stated  as  follows:   Given  an  optimum  solution  for 

the  linear  program  (2.1^),  find  an  optimum  solution  for  the  new  linear 

->(m+l) 
program  with  the  new  coefficient  row  a^     replacing  the  oldest  row 


M 


7  (i) 


a  ,    representing  the  change  from  |     to  £ 


(Mfl)  . 


i.e. 


where 


minimize 


c  x 


--   — ■  T 
subject  to   A*  x  >  b' 


x  >  0 

p      •  •  • 

T21 


(2.16) 


A*  = 


*2n 


m+l  1,  •  •  • ,  m+1  n 


(2.17) 


V  =  (b0,  .-.,  b  _)  . 
2'  '      m+l' 

The  effectiveness  of  the  methods  which  we  will  discuss  depends 

on  the  validity  of  an  assumption  that  an  optimum  solution  of  (2.l6)  does 

not  differ  "too  much"  from  the  one  of  (2.lU). 
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3-   Outline  of  the  Simplex  Method 

We  assume  that  readers  are  familiar  with  the  "basic  properties 
of  linear  inequalities  and  the  simplex  method,  which  is  a  computational 
procedure  to  solve  a  linear  programming  problem.  However  let  us  sketch 
important  concepts  which  will  be  used  often  in  the  rest  of  this  paper. 

3.1  Duality  Theorem  of  a  Linear  Program 

Consider  the  following  linear  program: 

maximize     b  v 

T  --   — 
subject  to   A  v  <  c  ( 3.1.1) 

v  >  0 

where  v  is  an  m-dimensional  vector  of  unknown  variables.   This  linear 
program  is  called  the  dual  of  (2.1*0  .   The  coefficient  matrix  is 
the  transpose  of  A  in  (2.1*0;  and  b  and  c  are  interchanged.   It  is 
known  that  when  we  solve  either  (3«l«l)  or  (2.1*0  by  the  simplex  method, 
we  will  find  either  optimum  solutions  to  both  or  infeasibility  of  the 
problems  (one  of  them  is  possible  unbounded.)   Therefore  we  can  solve 
the  more  conveninet  of  (3.1.1)  or  (2.1*0. 

3.2  Simplex  Method. 

Let  us  sketch  the  simplex  method.   See  the  literatures,  [8] 
and  [9L  -for  details. 

In  this  paper  we  will  work  on  the  dual  problem  (3»l»l)  rather 
than  the  primal  problem  (2.1*0,  because  (3«l«l)  in  our  case  is  more  ad- 
vantageous than  (2.1*0  in  several  respects,  as  will  be  seen  later. 

We  reformulate  (3.1.1)  in  the  following  form  by  introducing 
so-called  slack  variables  s.,  s_,  ...,  s  : 


11 


and 


maximize     h"  v 

subject  to 

_   m  -fr  n 


n 


1  ...  0 


.T 


0  ...  1 


v  >  0, 


>  0. 


v. 


V 


m 


(3-2.1) 


The  simplex  method  is  a  systematic  procedure  to  choose  a  sequence  of  sets 

of  n  basic  variables  (only  basic  variables  can  assume  non-zero  values  and 

non-basic  variables  are  assigned  zero. )  out  of  the  m+n  variables  v_ ,  . . . , 

v  ,  s_ ,  ....  s   until  we  obtain  an  optimal  solution  consisting  of  basic 
m   1       n 

variables  which  maximizes  b  v.   A  solution  which  satisfies  constraints 
of  (3«2.l)  but  is  not  necessarily  optimal  is  called  a  feasible  solution. 
When  c. ,  cp,  ...,  c  >  0,  we  may  choose 


v,  =  v~  = 


=  v  =  0 
m 


s. 

1 


=  c, 


( i  =  lj  2,  . . . }   n) 


(3.2.2) 


as   an  initial  feasible  solution.      Let  us   assume  that  u  of  (2.7)  be  non- 
negative  as   seen  in    (2.12)  for  example.      Then,    c_,    ...,    c     are     all 
non-negative   as  required  in  (3.2.2). 

The  simplex  method  is  most  conveniently  described  by  using 
tableau  representation.      The  first  simplex  tableau  is   shown  in  Fig.    3.2.1, 
whose  elements  are  shown  by  the  column  vectors  p  and  q,      the  row  vector 


12 


r,    and  the  matrix  H.      Here 

T 
q  =   (q^    '",    %}        is     the 

values  of  basic  variables  in 
the  current  tableau  (There- 
fore q  >  0  means  that  the 
solution  q  is  feasible.)  and 
P  =  (P-i*  '••}   P  )   denotes  the 
coefficients  of  the  basic  variables 


Fig.  3-2.1 
Simplex  Tableau 

n  +  m 


n 


q  in  the  objective  function  (i.e.  b.  's  of  b  in  (3*2. l)   corresponding 

J 


to  the  basic  variables  in  q).   H  may  be  divided  into  two  parts 


[hi,  EL,  ...,  K"     =  [d',  b"1  ], 

L  1'  2'  '     m+nJ     L  '  ' 


(3-2.3) 


where  h.   for  1  <  i  <  m  is  the  column  associated  with  the  variable  v.  and 


h.  for  m+l<i<n+mis  the  column  associated  with  s 


l-m 


E^ch  entry  of 


r  =  (v  T±> 


,    r   )  can  be  obtained  bv  the  following: 
m+n 


rQ  =  P  '  q 


r.  =  p  ♦  h.  -  b. 


(i  =  1,  2,  . . . ,  m  +  n) 


(3.2. U) 


where  b 


m+1' 


,  b    are  set  to  0  (see  (3.2.1)).   If  we  start  with  the 
7   m+n 


initial  feasible  solution  (3.2.2),  the  initial  tableau  consists  of 
0 


'  5 


T 


q     =      c 

D     =     A 

B~    =     I        (the  unit  matrix) 

rQ  =  0 

r.  =  -b.     (i  =  1,  2,  ....  m+n). 

1  1  7  7  7 


(3.2.5) 


At  each  simplex  tableau,  there  are  three  possibilities: 
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(1)  r  >  0  . 

(2)  Existence  of  i  such  that  r.  <  0  and  h.  <  0  . 

l  l  — 

(3)  Neither  (l)  nor  (2). 

Case  (l)  means  that  the  feasible  solution  at  the  current  tableau  is 
optimal  and  the  case  (2)  means  that  the  linear  program  has  an  un- 
bounded solution  (i.e.  the  infeasibility  of  the  primal  problem  (2.lU)  ^  -* 

).  Whenever  the  case  (3)  is  encountered,  the  so-called  pivot 
operation  is  applied  to  the  current  tableau  and  the  entries  are  trans- 
formed according  to  the  simplex  rule.,  deriving  the  next  tableau.  Then 
we  examine  "which  of  the  above  three  possibilities  holds  in  the  new 
tableau.  With  repeated  deriviation  of  tableaux  we  will  eventually 
obtain  the  case  (l)  or  (2).   The  pivot  operation  is  essentially  a 
process  of  replacing  one  basic  variable  (i.e.  a  column  of  the  current 
tableau)  by  a  new  basic  variable.   See  the  literature  [8]  and  [91  for 
the  details  of  the  derivation  of  simplex  tableau. 

If  the  simplex  method  ends  up  with  the  case  (l),  an  optimal 
solution  of  (3«l«l)  will  be  in  column  q  of  the  last  tableau  and  an 

optimal  solution  of  (2.1*0  will  be  (r  , ,  ....  r   )  whose  coordinates 

*  m+1     '  m+n 

are  the  values  of  x-  . . . ,  x  ,  respectively. 

Note  that  the  following  relations  hold  for  every  simplex 

tableau: 

q  =  B"1  c 

_!  -T  (3.2.6) 

h.  =  B    a.  (i=lj  2,  ...,  m+n) 

— -   T  T 

where  a.   is  the  ith  column  of  [A  I]  of  (3.2.1).  These  relations 

will  be  used  later. 


Ik 


h.        Adaptation  "by  the  Simplex  Method 

When  a  classifier  characterized  by  the  linear  program  (2.1^) 
adapts  itself  to  a  new  environment  "with  an  old  pattern  vector  replaced 
"by  a  new  one,  it  is  characterized  by  the  new  linear  program  (2.l6). 
Let  us  assume  (for  a  while)  that  the  linear  programs  (2.lU)  and  (2.l6) 
both  are  feasible.   The  case  where  they  are  infeasible  will  be  considered 
at  the  end  of  this  section  and  also  in  the  next  section. 

Now  suppose  that  we  have  solved  the  linear  program  (2.lU)  by 
using  the  dual  formulation  (3-1. l)  and  have  obtained  an  optimum  solution 
in  the  last  simplex  tableau  as  shown  in  Fig.  U.l.   The  adaptation  is 
essentially  an  efficient  method  for  deriving  an  initial  tableau  of  the  new 
problem  (2.l6)   (strictly  speaking, 
the  dual  formulation  of  (2.l6))  and 
applying  the  pivot  operations  until 


Fig.  k.l 
Last  Simplex  Tableau 


a  new  optimal  solution  is  obtained. 


p(F) 


W) 


r(F) 


This  type  of  problem  has  already  been  studied  and  is  included  in  the 
subject  of  parametric  programming.   In  this  paper,  however,  we  propose 
a  new  procedure  rather  than  using  the  standard  technique  of  parametric 
programming  "■  ^    ,   because  (l)  the  pattern  classification  problem 
is  suitable  for  the  dual  formulation  with  respect  to  signs  of  co- 
efficients (no  need  of  artificial  variables),  (2)  subsequent  pro- 
cedure for  adaptation  is  simpler  and  straightforward  (the  dual 
simplex  method  is  not  needed)  and  (3)  our  formulation  is  easily 
extendable  to  the  integer  linear  programming  formulation  which  will 
be  discussed  in  Section  5. 
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This  initial  tableau  is  expected  to  yield  faster  convergence 
than  the  ordinary  initial  tableau  given  in  (3.2.5). 

In  order  to  obtain  this  new  initial  tableau,  we  have  to  (l) 
eliminate  the  effect  of  the  inequality 

allXl  +  a!2  X2  +  "•  +  V  Xn  ^  bl  (4-l) 

which  corresponds  to  the  pattern  vector  to  be  eliminated,  and  (2)  intro- 
duce a  new  inequality 

a.nx_+a.0x0  +  ...+a.   x  >b  .         (U.2) 
m+1, 1  1    m+1, 2  2         m+1,  n  n  —  m+1  ' 

which  corresponds  to  a  new  pattern  vector,  without  increasing  the  size 

of  simplex  tableau. 

The  first  step  corresponding  to  (l)  is  to  set  b  equal  to  -  S 

where  S  is  a  sufficiently  large  positive  number  such  that  (^.l)   is 

satisfied  for  all  feasible  solutions  of  the  remaining  inequalities. 

This  means  that  (k. l)  is  now  non-restrictive.   Let  us  recall  that  the 

current  linear  program  is  treated  by  the  dual  formulation  and  therefore 

the  inequality  (*+.l)  is  associated  with  the  first  column  h  (F)  of  H(F) 

in  the  last  simplex  tableau.   The  change  of  b  causes  the  change  of 

an  entry  of  p(F)  which  corresponds  to  a  variable  v  (or  h -(F))  if  v 

is  in  the  basis.   But  it  causes  no  change  if  h_(F)  is  not  in  the  basis. 
The  entries  of  r(F)  are  accordingly  recalculated  by  (3.2. U).   After 
this,  simply  delete  the  first  column  h  (F)  and  r_.   The  deletion  is 
permissible  because  h..(F)  henceforth  will  not  enter  the  basis  (since  r 

will  not  be  negative.). 

The  second  step  corresponding  to  (2)  above  is  to  introduce 
the  new  inequality  (k.2)  into  the  column  where  (U.l)  was  eliminated. 
The  new  entries  can  be  obtained  by  (3«2. k)   and  (3-2.6),  i.e. 

h  -  B_1  Vi  (U-3) 


l£ 


where   a*+1  =  (a^^  ^   ^1,2,...,  Vl/ 

and 

ri  =  *  '  ffi  "  Vr  <^ 

Note  that  q  was  not  changed  in  the  above  two  steps  and  there- 
fore the  new  tableau  still  has  a  feasible  solution  (although  it  may  not 
be  optimal) .  Thus  this  new  tableau  can  be  used  as  an  initial  tableau 
for  the  dual  of  the  problem  (2.l6). 

If  all  the  entries  of  r  are  non-negative,  the  old  solution 
is  optimal  for  the  new  problem  also.   But  if  r  contains  some  negative 
entries,  apply  the  pivot  operations  until  an  optimal  solution  is  derived 
for  the  new  problem. 

Although  we  discussed  the  procedure  only  for  the  case  in  which 
one  inequality  is  replaced,  extention  to  the  case  of  more  than  one  in- 
equality is  obvious. 

The  procedure  would  need  less  computation  time  if  the  coordinates 
of  the  new  solution  do  not  deviate  "too  much"  from  those  of  the  old  sol- 
ution since  the  new  pivot  operation  starts  from  a  "closer"  solution  to 
the  new  problem  than  the  ordinary  initial  solution  of  (3»2.5)« 

Example.  Fig.  k.2 

Linear  Classifier 
Let  us  pursue  the  adaptation  x 


w 


of  the  linear  classifier  shown  in  Fig.  k.2, 
when  pattern  vectors  change  as  follows  from 
(U.5)  through  (U.7): 


(i)   (X;l,x2):   (11),  (10)     yield  f  =  l 


fCx^Xg) 


(01)  yields  f  =  0 


(U.5) 


IT 


(ii)   (x1,x2):   (11),  (00)     yield  f=l 

(01)  yields  f  =  0  ^'^ 

(iii)  (Xl,x2):   (11)  yields  f  =  1 

(01),  (00)     yield  f  =  0.  ^'^ 

Corresponding  to  these  sets  of  pattern  vectors  we  have  the  following 
sets  of  inequalities  by  (2.4): 


(i)       w.      +  w  >  1  (4.8) 


w1  +  w2 

4-  W 
o 

>  1* 

wl 

+  w 

0 

>  1 

W2 

+  w 

0 

<  -1 

w  +  w  +  w  >  1 

(ii)  w  >  1  (U.9) 

w~  +  w  <  -1 
2    o  — 

w_.  +  w_  +  w  >  1 
1    2    o  — 

(iii)  w  <  -1  (U.10) 

o  — 

w  +  w  <  -1 

Let  us  split  these  variables  w. 's  as  seen  in  (2.5)  in  order  to  keep 

the  non-negertiveness  of  variables  in  the  simplex  method.   The  objective 

function  to  be  minimized  is  now: 

1^1  +  |w2|  =  w1  +  W;L"  +  w2+  +  w2~.        (^.H) 
Renaming  these  split  variables  and  changing  the  direction  of  some  in- 
equalities, the  above  sets  of  inequalities  may  be  rewritten  as  follows, 
corresponding  to  (2.1^): 


d  in  (2.4)  is  set  to  1  for  simplicity. 
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(i)  x1    -   X_  +  X_   -  Xs  >  1  v_  (U.12) 


xl  - 

X2 

+ 

X3 

\ 

+ 

X5 

x6 

> 

1 

vl 

xl  - 

X2 

+ 

X5 

- 

x6 

> 

l 

V2 

- 

X3 

+ 

Xk 

- 

X5 

+ 

x6 

> 

1 

V3 

xl  - 

X2 

+ 

X3 

" 

Xk 

+ 

X5 
X5 

*" 

x6 
x6 

> 
> 

l 
1 

vl 

- 

X3 

+ 

Xk 

- 

X5 

+ 

x6 

> 

l 

V3 

xl  - 

X2 

+ 

X3 

- 

\ 

+ 

X5 

- 

x6 

> 

1 

vl 

- 

X5 

+ 

x6 

> 

1 

V5 

- 

X3 

+ 

\ 

- 

X5 

+ 

x6 

> 

l 

V3 

(ii)  x,  -  x.  >  1  v,,  (U.13) 


(iii)  -  x5  +  x6  >  1  v5  (k.lk) 

-  x3  +  xk  -  x5  +  x6  >  1  v3 

where  v.    is   shown  simply  for  identification  of  each  inequality.      The 
objective  function  in  the  renamed  variables   is 

x1  +  x2  +  x     +  x^  (^.15) 

Note  that  the  following  relations  hold  among  the  original  and  renamed 
variables 

x±  -  x2  -  w1     -  w^  =  wx 

x     -  x^  =  w2      -  w2~   =  w2  (h.l6) 

+ 
xr.-x^=w        -w       =w 
56000 

Assume  that  our  classifier  has  the  first  set  of  pattern 

vectors  (i).     An  initial  tableau  for  the  dual  of  this  linear  program 

will  be  derived  by  (3.2.5).      Table  k.l  shows  the  initial  tableau 

derived.      After  applying  the  pivot  operation  three  times   (Tables  h.2 

and  U. 3  and  k.k) ,    an  optimal  solution  will  result  since  Table  h.k 

contains  no  negative  entry  in  v.      The   solution  is  obtained  in   (r^,    ..., 

r9),    i.e., 


L9 


x1  =  2 

x6  =  1  (U.17) 

x2  =  X3  =  Xk   =  x5  =  °   ' 


which  implies 


wl 

=  2 

W2 

=   0 

w 
o 

-   -1 

(U.18) 


Then  assume  that  the  first  set  of  pattern  vectors  is  changed 
to  the  second  set  (ii).   According  to  the  procedure  discussed,  elimi- 
nate the  inequality  which  corresponds  to  the  old  pattern  vector  to  be 
replaced, 

x±   -  x2  +  x^  -  x6  >  1      v2  (U.19) 

and  instead  introduce  the  inequality 

x5  -  x6  >  1       v^  >  (U.20) 
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Variables 
in  the 
basis 


S5 
s6 


1 

-1 
1 

-1 

1 
-1 


V2 

1 

-1 

0 

0 

1 
-1 


v 


-1 


1 
0 
0 


0 
0 
0 

1 


column 
name 


r-»   0 


-1   -1 


0 


0    0    0 


Table  k.l  Initial  Tableau  for  the  First 
Set  of  Pattern  Vectors 


v. 


v. 


V, 


sl 

0 

1 

0 

0 

<i> 

1 

0 

0 

0 

-1 

0 

S2 

0 

1 

0 

0 

-1 

0 

1 

0 

0 

1 

0 

S3 

0 

1 

0 

-1 

0 

0 

0 

-1 

0 

-1 

0 

% 

0 

1 

0 

1 

0 

0 

0 

0 

1 

1 

0 

vl 

1 

0 

1 

1 

-1 

0 

0 

0 

0 

1 

0 

s6 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

1 

0 

0 

0 

-2( 

0 

0 

0 

0 

1 

0 

Table  k.2     Intermediate  Tableau  for  the  First 
Set  of  Pattern  Vectors 
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V3 

l 

1 

0 

0 

1 

1 

0 

0 

0 

-1 

0 

S2 

0 

2 

0 

0 

0 

1 

1 

0 

0 

0 

0 

S3 

0 

1 

0 

-1 

0 

0 

0 

1 

0 

-1 

0 

% 

0 

1 

0 

1 

0 

0 

0 

0 

1 

1 

0 

vl 

1 

1 

1 

1 

0 

1 

0 

0 

0 

0 

0 

s6 

0 

0 

0 

0 

0 

0 

0 

0 

0 

© 

1 

2 

0 

0 

0 

2 

0 

0 

0 

-1 

0 

Table  k.3 
Intermediate  Tableau  for  the  First 
Set  of  Pattern  Vectors 


V3 

1 

1 

0 

0 

1 

1 

0 

0 

0 

0 

-1 

S2 

0 

2 

0 

0 

0 

1 

1 

0 

0 

0 

0 

S3 

0 

1 

0 

-1 

0 

0 

0 

1 

0 

0 

1 

% 

0 

1 

0 

1 

0 

0 

0 

0 

1 

0 

-1 

vl 

1 

1 

1 

1 

0 

1 

0 

0 

0 

0 

0 

S5 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

1 

2 

0 

0 

0 

2 

0 

0 

0 

0 

1 

Table  k.h 

Optimal  Tableau  for  the  First  Set  of 
Pattern  Vectors 


b-» 

1 

-100 

1 

0 

0 

0 

0 

0 

V3 

1 

1 

0 

0 

1 

1 

0 

0 

0 

0 

-1 

S2 

0 

2 

0 

0 

0 

1 

1 

0 

0 

0 

0 

33 

0 

1 

0 

-1 

0 

0 

0 

1 

0 

0 

1 

sl. 

0 

1 

0 

1 

0 

0 

0 

0 

1 

0 

-1 

vl 

1 

1 

1 

1 

0 

1 

0 

0 

0 

0 

0 

S5 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

1 

2 

0 

101 

0 

2 

0 

0 

0 

0 

1 

Table  U.5 
Elimination  of  the  Old  Vector  V„ 


oo 


Replace  1  in  b   (of  v  )  by  a  sufficiently  small  number,  say  -  100. 
Since  vp  is  not  in  the  basis  in  Table  U.U,  the  replacement  does  not 
cause  any  change  of  any  entry  except  rp  of  Table  k.k.     Table  k.5   shows  the 
resultant  tableau.   Next  delete  h  and  r  from  Table  k. 5  and  fill  in 
new  entries  which  are  derived  from  the  new  inequality  (^.20)  according 
to  (^.3)  and  (U.U).   Note  that  B  "  is  [hi,  h  ,  ...,  h  ]  in  this  case. 
The  result  is  shown  in  Table  U.6.   Since  there  is  a  negative  entry  in 
r,  apply  the  pivot  operation.  After  two  applications,  a  new  optimal 
solution  is  derived  as  shown  in  Table  h."(.      It  is 

=  1  (U.21) 


which  lead  to 


xl 

2, 

X^  = 

2,    : 

X2 

= 

X3 

=  x6 

=  0 

Wl 

= 

2 

W2 

= 

-2 

w 
0 

= 

1 

• 

(U.22) 
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Vl 

vk 

v3 

sl 

S2 

S3 

% 

S5 

G6 

V3 
S2 

S3 
% 
vl 
S5 

1 
0 
0 
0 

1 

0 

1 

2 

1 
1 
1 
0 

0 
0 
0 
0 

1 

0 

-1 

0 

-1 

Q 

0 
0 

1 
0 
0 
0 
0 
0 

1 
1 

0 
0 

1 

0 

0 

1 

0 
0 
0 
0 

0 
0 

1 

0 
0 
0 

0 
0 
0 

1 

0 
0 

0 
0 
0 
0 
0 

1 

1 

0 

1 
-1 

0 

1 

2 

0 

-2 

0 
*. 

2 

0 

0 

0 

0 

1 

p 

4 


Table  k.6 
Initial  Tableau  for  the  Second 
Set  of  Pattern  Vectors 


V3  ' 
S2 

S3 
vh 
vl 
% 

1  ! 

0 
0 

1 
1 

0 

2 
2 
2 
1 
1 
0 

0 
0 
0 
0 

1 

0 

0 

0 
0 

1 

0 
0 

1 

0 
0 
0 
0 
0 

1 

1 

0 
0 

1 

0 

0 

1 

0 
0 
0 
0 

0 

0 

1 

0 
0 
0 

1 

0 

1 
1 

0 
0 

0 

0 
0 

1 

0 

1 

0 

0 
0 
0 
0 

1 

k 

0 

J 

0 

0 

2 

0 

0 

2 

1 

0 

Table  k.7 

Optimal  Tableau  for  the  Second 

Set  of  Pattern  Vectors 


2k 


The  change  from  the  second  set  of  pattern  vectors  ( ii)  to 
the  third  set  ( iii)  can  be  treated  in  a  similar  manner.   In  this  case 
p  column  is  changed  "because  v.  is  in  the  basis  as  seen  in  Table  k.Q. 
Then  Table  U.9  is  the  initial  tableau  for  the  third  set  (iii).   An 
optimal  solution  is  obtained  in  Table  ^-.10  after  two  pivot  operations. 
It  is 

Wl  =2>   W2  =  °'  Wo  =  _1  (If,23) 

From  (k.lQ),   (1+.22),  (^.23),  we  can  see  that  the  structure  of  the 

linear  classifier  has  changed  as  shown  in  Fig.  U.3« 

Fig.  k. 3  Adaptation  of  Linear  Classifier 


1  -*♦ 


^0-      ^©- 


vl 

vk 

v3 

sl 

S2 

s3 

S3 

% 

S5 

V3 

l 

2 

0 

0 

1 

1 

0 

0 

1 

0 

0 

S2 

0 

2 

0 

0 

0 

1 

1 

0 

0 

0 

0 

S3 

0 

2 

0 

0 

0 

0 

0 

1 

1 

0 

0 

vk 

-100 

1 

0 

1 

0 

0 

0 

0 

1 

1 

0 

vl 

1 

1 

1 

0 

0 

1 

0 

0 

0 

0 

0 

s6 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

1 

-97 

0 

0 

0 

2 

0 

0 

-99 

-100 

0 

Table  k.Q 
Elimination  of  the  Old  Vector  v, 
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vl 

V5 

V 

3 

sl 

S2 

S3 

% 

s5 

s6 

V3 
S2 

S3 

Yh 
vl 
s6 

1 

0 

0 

-100 

1 

0 

2 
2 
2 

1 

1 
0 

0 
0 
0 
0 

1 

0 

0 
0 
0 

-1 

0 
0 

1 

0 
0 
0 
0 
0 

1 
1 

0 
0 

1 

0 

0 

1 

0 
0 
0 
0 

0 
0 

1 

0 

0 
0 

1 

0 

1 
1 

0 
0 

0 
0 
0 

1 

0 

1 

0 
0 
0 
0 
0 

1 

-97 

0 

99 

0 

2 

0 

0 

-99 

-100 

0 

Table  k.9 
Initial  Tableau  for  the  Third 
Set  of  Pattern  Vectors 


V3 

1 

1 

0 

1 

1 

1 

0 

0 

0 

0 

1 

S2 

0 

2 

0 

0 

0 

1 

1 

0 

0 

0 

0 

S3 

0 

1 

0 

1 

0 

0 

0 

1 

0 

0 

1 

% 

0 

1 

0 

-1 

0 

0 

0 
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Table  1*.10 
Optimal  Tableau  for  the  Third 
Set  of  Pattern  Vectors 
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Now  let  us  consider  the  case  in  which  the  given  set  of  pattern 

vectors  are  not  separable,  i.e.,  (2.l4)  (or  (2.l6))  is  not  feasible. 

By  adding  an  artificial  variable  to  each  inequality  of  (2.l4),  (2.l4) 

can  be  rewritten  as 

a .  _  x_  + +a.  x  +  t .  >  b . 

ll  1  in  n    i  —  1 

t.  >  0  (i  =  1,  2,    . ..,  m) 

x  >  0  (J  =  1,  2,  ...,  n).     (4.210 

J 

Obviously  (4.24)  is  feasible  because  the  set  of  inequalities  are 
satisfied  by  a  sufficiently  large  positive  value  of  each  t..   However 
the  positiveness  of  t.  does  not  mean  that  the  corresponding  ith  in- 
equality of  (2.14)  is  satisfied.   In  order  to  circumvent  this  fact 
let  us  use  the  following  new  objective  function. 

m      ^   _ 

V  Z  t.  +  c  .  x  (4.25) 

1=1  x 

where  V  is  some  large  positive  number.   Roughly  speaking,  the 

minimization  of  this  objective  function  means  that  it  first  minimizes 

m 

Z   t.  which  will  tend  to  minimize  the  number  of  incorrectly  separated 
i  =  1  X 
vectors  and  that  it  next  minimizes  ex,  the  original  objective  function. 

The  set  of  inequalities  (4.24)  with  the  objective  function  (4.25)  can 

be  solved  in  a  similar  manner  as  the  case  of  (2.l4),  since  (4.24)  is 

now  feasible.   Although  the  number  of  artificial  variables  increases 


m 


Ideally,  .Znt.  is  to  be  minimized,  no  matter  what  value  ex  assumes, 
i=l  l 

However,  this  is  not  achieved  by  (4.25)  because  t. 's  are  continuous 

m 

variables.   In  other  words,  the  value  of  .ILt.,when  (4.25)  is 

'  1=1  i » 

minimized,    depends  upon  the  relative   size  of  V  and  ex  and  may  not 

m 

be  minimized.   Note  that  if  t.'s  are  discrete  variables,  .E,t.  is 

i  i=l  l 

minimized  as  will  be  discussed  in  Section  5« 
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indefinitely  if  the  adaptation  is  repeated  infinitely,  the  storage 
space  increase  can  be  avoided  by  replacing  the  old  artificial  variable 
to  be  eliminated  by  the  new  artificial  variable. 

The  objective  function  (U.25),  however,  does  not  always  lead  to 
the  minimum  number  of  erroneously  separated  pattern  vectors.   As  will 
be  discussed  in  the  next  section,  the  exact  minimization  of  the  number 
of  incorrectly  separated  vectors  will  be  obtained  by  using  integer 
linear  programming. 
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5.   Formulation  and  Adaptation  "by  Integer  Linear  Programming 

Given  a  non- separable  set  of  pattern  vectors,  one  realization  of 
a  linear  classifier  is  to  minimize  the  number  of  incorrectly  separated 
vectors.   This  can  he  achieved  by  the  following  integer  linear  pro- 
gramming approach. 

Corresponding  to  each  input  vector  £    of  (2.k), 

w-I^+w  >d     if  f(r^)  =  1  (5.1A) 


o  — 


or 


-w-i^^-wo>d    if  f(1^0  =  0  ,  (5- IB) 

formulate  the  two  inequalities  for  each  of  (5.1A)  and  (5. IB)  as  follows: 

r^w  +  w  +UP.  >  d  (5.2A) 

o     J  - 

TvJ)^  ,  T.   _  TT/n    T>    \    *r         A  A-P    Wfc(j) 


|vj;w  +  w  -U(l-P,)<-d     if  f(|vj;)  =  1  (5-3A) 

O  o        — 


or 


-  r  w  -  w  +UP.  >  d  (5.2B) 

0     3    ~ 

-i^'V-w   -u(i-p.)<-d       if  f(r^)  =  0  (5.3B) 

^  J 

where  P.  is  a  variable  which  assumes  1  or  0,  and  where  U  is  a  suf- 

ficiently  large  positive  number  to  insure  the  following  property. 

When  P  =  0,  (5.2A  and  B)  is  obviously  identical  to  (5.1A  and  B) 

J  ^  (  *) 

respectively  and  (5.3A  and  B)  is  satisfied  by  any  value  of  S   i  •   This 


me 


ans  (5.1A  or  B)  is  satisfied  by  w  and  accordingly  the  pattern  vector  is 


classified  correctly.   On  the  other  hand,  when  P  =  1,  (5-3A  and  B)  become 

J 


w 
or 


^^tj')+w  <-d     if  f  -  1  (5-^A) 

_^^j)_w_<_d     if  f  =  0  .  (5- UB) 


o  — 
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The  pattern  vector  is  separated  incorrectly.   Therefore 


m 

2  P,  (5.5) 

5=1    J 


shows  the  number  of  incorrectly  separated  pattern  vectors. 
Now  consider  the  objective  function: 

m 


minimize  V  Z   P  .  +  u  w  -,  (5.6)- 

0=1   J 

where  u  is  the  vector  in  (2.7)  and  where  V  is  a  sufficiently  large 


positive  number  (i.e.  V  >  Max  u  w  -).  Different  from  the  objective 
function  (k.25)    in  Section  h,    the  minimization  of  this  objective  function 
means  the  minimization  of  both  Z  P.  and  u  w  -,  because  Z  P.  assumes 
discrete  values  only.   This  is  the  property  which  we  desired. 

Of  course,  we  can  consider  other  objective  functions  if  they  are 
needed.   The  incorporation  of  integer  variables  extensively  widens  the 
concept  of  objective  functions.   For  example,  it  is  possible  to  minimize 
the  number  of  non-zero  weights,  i.e.,  the  number  of  inputs  actually  needed, 
by  reformulating  the  integer  linear  programming  with  additional  integer 
variables  and  changing  the  objective  function. ** 

This  new  linear  program  is  a  mixed-integer  linear  programming 
problem  because  P .  must  be  integral.   There  are  several  known  methods 
to  solve  this  type  of  problem.   Among  them,  we  will  use  Gomory's  all- 
integer  integer  linear  programming  method     by  assuming  that  the 
other  variables  are  also  integers,  without  loss  of  generality  in  the 


*      A  weighted  sum  Z  e.  P.  may  be  used,  when  there  is  preference 
among  pattern  vectors. 

**     This  formulation  is  due  to  F.  Chen. 
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sense  that  both  formulations  will  yield  the  same  minimum  number  of 
incorrectly  separated  pattern  vectors.  This  integral  condition  is 
adopted  because  his  method  is  very  similar  to  the  simplex  method  which 
was  applied  to  the  dual  problem  discussed  in  the  earlier  sections. 
The  adaptation  process  will  be  accordingly  analogous  to  the  case  of 
these  earlier  sections. 

In  order  to  apply  Gomory's  method  we  need  another  constraints, 

P  <  1  ,   J  =  1,  2,  ...,  m.  (5.7) 

J 

For  notational  convenience,  let  us  henceforth  denote  our  new  problem 
of  (5-2),  (5.3),  (5-7)  together  with  the  objective  function  (5.6) 
by: 

minimize  ^  - 


subject  to 


A*x>b**T  (5-8) 


x  -^  o  and  integers, 
where  x   is  an  n dimensional  vector  of  unknown  variables  to  be  de- 
termined and 

c*  =  (c*,  ,  c*,) 

A*  :  f  aiV">    ain«  (5 

am'l"-->  am'n« 
b*=  (bj,  ....,  b*,). 
In  our  case,  m'  is  equal  to  3M  and  n'  is  the  number  of  variables 

Xl'  '  V  Pl'  '  Pm'  which  is  2  (N  +  !)  +  3  M. 

Before  discussing  our  adapatation  process,  let  us  outline 

Gomory's  method.   Rigorous  description  and  proofs  are  found  in  [10]. 
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The  method  is  very  similar  to  the  ordinary  simplex  method  applied 
to  the  dual  formulation.  A  simplex  tableau  of  the  method  is  shown  in 
Fig.  5.1.  Each  column  of  H  corresponds  to  an  inequality  of  (5«8). 
(Slack  variables  are  taken  into  the  account.   See  (3*2.1).) 


However  when  some  entries 
of  the  row  r  are  negative, 
we  pick  a  certain  column 
according  to  a  rule 
stated  in  [10],  and 
form  a  new  column 
called  a  "cut"  from 
that  column.   The 
cut  is  the  column 


Fig.  5.1  Tableau  for 
Integer  Linear  Programming 


Cut 


*V  +  n*  +  1 


□ 


I  rm'  +  n'  + 


(h  ,    ,  ,  ..r  ,    ,   -.Jin  Fig.  5.1.   The  procedure  to  derive  the 
m'+n'+l'm'+n'+l/  r 

cut  is  also  discussed  in  [l0].   Then  the  ordinary  pivot  operation  is 
performed  using  this  cut.   This  process  is  repeated  until  we  obtain 
all  non-negative  entries  in  r  or  find  at  least  one  column,  say  i, 
such  that  r.  <  o  and  IT.  <  o  (infeasibility  of  (5-8)).  Thus,  only 
cuts  can  enter  the  basis.  The  initial  tableau  could  be  the  one  as 
shown  in  (3*2. 5) • 

Although  Gomory  did  not  include  the  column  p  in  his  method,  p  is 
necessary  for  our  adaptation  procedure.   In  the  simplex  method  of 
Section  3,  the  colum  p*  had  the  following  meaning:   suppose  p.  corresponds 

to  column  h*  which  was  transformed  from  the  inequality 
J 


a„  x.,  +  . . .  +  a.  x  >  b  ., 

Jl  1        jn  n  -  y 


and  then  p.  =  b.  holds, 
i    J 
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The  column  p.  in  Fig*.  5.1->-of  integer  linear  programming  also  has  the 
similar  meaning  except  that  since  each  basis  is  derived  through  the 
cut  rather  than  each  inequality  as  in  the  simplex  method,  p.  cor- 
responds to  the  right  hand  side  of  the  cut  through  which  p.  was 
introduced. 

p  can  "be  calculated  "by  (3'2.k)   each  time  a  cut  is  introduced 
into  a  basis.  Assume  that  a  cut  is  to  be  introduced  into  the  i-th 
row  by  the  pivot  operation, then  the  equation  from  (3»2.U) 

rm»  +  n«  +  1  =  P  *  hm»  +  n1  +1  ~^t  +  ni  +  x 
is  rewritten  as 

bm«  +  n«  +  1  =  *  '  V  +h'  +  1  "  V  +  n«  +  1      ^-10) 

where  only  b*     „   ,  is  unknown  variable  because  all  the  inequality  in 
mT  +  n  +  1 

the  m'  +  n'  +  1  -th  column  is  obtained  by  Gomory's  algorithm  [10].   Now 
the  new  p.  which  will  be  used  in  the  next  step  is 

»i =  K<  +  »■  + 1  {%11) 

The  other  entries  of  p  do  not  change. 

Applying  this  pivot  operation  until  all  the  entries  in  r 

become  non-negative,  an  optimal  integer  solution  for  (5«8)  is  found  in 

(r  ,   ..       r  ,    ,)  as  in  the  case  of  the  ordinary  simplex  method. 
mf  +1,  ...,  m1  +nf 

Suppose  that  we  have  obtained  an  optimal  solution  for  the 
problem  (5.8).  Now  eliminate  the  effect  of  the  oldest  pattern  vector 
and  let  us  introduce  a  new  pattern  vector.   Let  us  express  the  oldest 
pattern  vector  by 

Bllx1+....  +alnXn  +  UP1>b1  (5.12) 

a,,  x     +...+  a     x     -  U(l-P.)  <  b_    -  1  (5-13) 

11     J-  In  n  1     —     1 
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or  by  renaming  the  variable  coefficients  and  the  constant  and 

by  changing  the  direction  of  the  second  inequality, 

aj1x1  ....  +  a^,  xn,  >b*  (5.1*0 

a|1x1+  ....  +  agn,  xn»  >b*.  (5.15) 

Similarly  let  us  denote  the  new  pattern  vector  by 

am'+l,  1*1+....  +  am>  +  l,n-  V  *  X'   +  1   (5"l6> 

am-  +2,1X1+  "••  +am'  +  2,n-  V  ^  bS-  +2.   (5.17) 
which  coorespond  to  (5.2  or  5-3)  for  the  given  pattern  vector. 

The  elimination  of  the  oldest  pattern  vector  can  be  done  in  a 
similar  way  as  before:   replace  b*  and  b*  by  -  S  where  S  is  a 
sufficiently  large  positive  number.   However,  this  affects  the  tableau 
only  through  the  cuts  which  are  derived  from  the  inequalities,  (5.1*0 
and  (5-15).   (The  classifier  must  ha^  memory  space  for  this  informa- 
tion. )   In  other  words,  this  means  replacement  of  the  entries  of  p 
which  correspond  to  the  cuts  which  were  derived  from  (5.1*0  (5»15),  by 
-  S  so  that  the  cuts  become  non-restrictive.   More  than  one  entry  or 
no  entry  may  exist,  depending  on  what  the  current  basis  is. 

The  next  step  is  to  delete  the  columns  ft  ,   ft    and  the  entries 
r_,  r  which  correspond  to  (5. 1*0  and  (5-15).  And  then  calculate  the 
new  columns  hi,  ft    from  the  new  inequalities,  (5.16)  and  (5.17),  by 
using  the  relation  (3.2.6).   Of  course,  B    =  [ft   t   , 

HI    "T-   _L^    •  •  •  y 

ft  ].   Note  that  the  variable  P  ,   ,  which  is  implicitly  in 

m*  +  n*  m'  +  1 

(5.l6)  and  (5.17)  should  be  replaced  by  P  of  the  oldest  pattern 
vector  in  order  to  prevent  the  increase  of  the  number  of  variables. 
Finally  the  new  row  r  can  be  obtained  by  (3.2.  U).   (Here  we  need  p 
which  was  obtained  by  (5.11).) 
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If  there  are  negative  entries  in  the  new  row  r,  Gomory's  pivot 
operation  is  repeated  as  discussed  before  until  we  obtain  an  optimal 
solution.   This  completes  our  adaptation  procedure.   (Note  that  the 
condition  r.  <  o  and  E\  <  o  will  not  be  reached  since  our  problem 
is  always  feasible. )  The  whole  process  will  be  repeated  when 
pattern  vectors  are  changed. 
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6.       Entire  Scheme  Of  Adaptation  And  New  Pattern  Vector  Identification 

The  adaptation  procedure  of  a  linear  classifier  which  we  have 
discussed  in  the  previous  sections  is  summarized  as  follows:   Given  an 
optimal  structure  for  the  set  of  distinct  pattern  vectors  {£  "  ,%  ,..., 

|    ] ,  the  classifier  readjusts  itself  so  that  its  structure  is  optimal 

for  the  new  set  of  distinct  pattern  vectors  {|    ,...,%  ,    %  } 

~*  rl)  -  'M+l) 

where  |     is  replaced  by  |      .   However  we  did  not  discuss  how  to 

""■  'M+l) 
identify  |       among  the  pattern  vectors  incoming  as  a  time  series,  in  order 

to  get  the  new  set  {£    ',...,£,  )    for  which  the  classifier's  structure 

ought  to  be  optimal.   There  are  a  few  different  identification  schemes 

conceivable.   These  schemes  will  be  discussed  in  this  section. 

The  entire  system  of  the  linear  classifier  is  illustrated  in  Fig. 6.1. 

Block  C  stores  the  last  simplex  tableau  for  {|    ,  •  -  ■  >    t  }•   Block  A 

examines  an  incoming  vector  |   and  decides  whether  it  should  be  considered 

as  a  new  pattern  vector  £       by  checking  the  information  about  {£        ,  •  •  •> 

|   l  }  stored  in  Block  C  ''the  adjustment  of  structure  of  the  classifier 

results^,  or  not  'no  adjustment  results').   If  the  new  pattern  vector  | 

-  (2) 
is  identified,  Block  B  computes  the  new  optimal  structure  for  f|    >•-•> 

|    ,    i  }  by  starting  from  the  last  simplex  tableau  for  {  |    >•••> 

%  )    stored  in  Block  C.   Using  this  new  structure,  Block  D  classifies  the 

incoming  vector  |   .  ('The  |  must  be  kept  in  a  buffer  memory  during 

the  identification  process  by  Block  A  and  the  computation  of  the  new 

structure  by  block  B. )   The  last  simplex  tableau  is  stored  in  Block  C. 

Note  that  the  last  simplex  tableau  now  stores  the  information  about 

fg    ,...,£    +  1  instead  of  (|   ',...,%    v   1.   This  completes  a  cycle 

of  adaptation  procedure. 
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Fig.    6.    1.    Adaptation  Scheme 


37 


The  entire  procedure  is  characterized  by  specifying  how  Block  A 

~*  CM+l) 
identifies  the  new  pattern  vector  £  which  is  then  processed  by  Block  B. 

If  Block  A  identifies  very  often  an  incoming  vector  |   as  a  new  pattern 

"*  (M+l) 
vector  £  ,  then  the  whole  adaptation  procedure  also  must  work  as 

often  and  it  slows  down  the  processing  speed  of  the  classifier.   In  this 

case,  however,  the  accuracey  of  classification  by  the  classifier  is  maintained. 

On  the  other  hand,  if  very  few  incoming  pattern  vectors  are  processed  for 

adaptation,  the  processing  speed  will  be  much  faster.   (Since  the  interval 

between  adjacent  adaptations  is  long,  the  buffer  memory  will  not  be  filled 

up  by  incoming  pattern  vectors  which  are  waiting  for  the  classification. 

Therefore  the  transmission  rate  of  pattern  vectors  can  be  speeded  up) . 

When  memory  space  of  Block  C  is  limited,  the  size  of  M  is  limited. 
Therefore  if  the  number  of  distinct  vectors  in  the  time  series  is  more  than 
M,  some  selection  of  M  vectors  out  of  all  distinct  vectors  must  be  made 
for  adaptation.   Therefore,  generally  speaking,  even  if  the  time  series 
contains  new  pattern  vectors  the  adaptation  may  not  take  place  and  the 
classifier  may  not  work  correctly  for  each  pattern  vector. 

In  the  following, three  simple  schemes  to  identify  new  pattern 
vectors  are  arranged  in  the  desending  order  of  adaptation  frequency. 
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Adaptation  scheme  (l) ;  Regard  every  incoming  pattern  vector  as  the 
new  vector  £  ,  no  matter  whether  it  is  actually  new  or  not.   This 

eliminates  the  checking  procedure  of  Block  A.   This  scheme  guarantees  that 
the  current  structure  is  optimal  for  the  last  M  pattern  vectors.   But  the 
classifier  does  adaptation  all  the  time  and  the  optimality  of  the  structure 
is  v^lid  only  over  the  last  M  pattern  vectors. 

Adaptation  scheme  (2);   Block  A  checks  whether  each  incoming  pattern 
vector  is  new  or  not,  by  comparing  it  with  those  stored  in  Block  C.   If  it 
is  new,  the  classifier  solves  a  linear  program,  otherwise  it  does  not. 
This  scheme  is  different  from  (l)  in  that  the  structure  in  (2)     is  optimal 
for  M  distinct  pattern  vectors  which  appeared  most  recently.   These  M 
distinct  pattern  vectors  are,  of  course,  stored  in  the  simplex  tableau. 

Adaptation  scheme  (3) ;  This  scheme  is  very  similar  to  the  so-called 
error-correction  procedure        of  an  adaptive  element,  as  far  as  the 

selection  of  a  pattern  vector  is  concerned.  £        is  regarded  as  the  new 

"*  (M+l)  ~*  ' 

pattern  vector  £      ,  only  if  £        is  classified  erroneously  by  the 

current  structure.   The  checking  of  whether  it  is  classified  correctly 

or  not  is  done  by  simply  substituting  the  current  solution  to  the 

corresponding  inequality.   Note  that  if  the  pattern  £        is  separated  correctly, 

the  current  optimum  structure  for  [£  ,...,£  1  is  also  optimum  for 

£        in  the  sense  that  it  is  optimum  for  [£        ',...,  t    ,  I   ).   However, 

obviously  the  optimum  structure  for  Uvy,...,  £  ,    £      }  which  is  obtained 

->  (1)        -  ' 
by  replacing  £    by  i   ,  may  be  different  from  the  current  structure. 
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As  a  result  the  adaptation  will  take  place  less  frequently  th°n 
the  other  schemes.  In  other  words,  majority  of  incoming  vectors  are 
not  included  in  the  M  vectors  but  it  is  likely  the  t  they  are  correctly 
classified  when  they  appear  again.   The  current  structure,  however,  is 
optimum  only  for  the  last  M  pattern  vectors  for  which  the  classifier 
solved  linear  programs, because  the  readjusted  structure  may  be  no  longer 
optimum  for  the  pattern  vectors  which  were  not  identified  as  new 
pattern  vectors  and  therefore  h^ve  nothing  to  do  with  the  current  structure. 

In  addition  to  the  above  three  schemes  depending  upon  the  given 
situation,  various  sophisticated  approachs  might  be  also  conceivable 
including  the  addition  of  the  random  sampling  approach.   However,  various 
aspects  of  incoming  pattern  vectors  should  be  examined  and  possibly 
experimental  justification  is  necessary,  in  order  to  find  a  proper 
scheme  . 
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7.      Computational  Experiment  of  Adaptation  by  the  Simplex  Method 

The  adaptation  algorithm  by  the  simplex  method  discussed  in  Section  k 
was  tested  by  actually  solving  a  sequence  of  sets  of  pattern  vectors. 
Each  pattern  set  consisted  of  30  30-dimensional  pattern  vectors  whose 
coordinates  are  1  or  0.   Each  two  consecutive  pattern  sets  differ  in 
exactly  one  pattern  vector  so  that  the  adaptation  algorithm  may  be  applied. 
First,  we  generated  60  pattern  vectors  together  with  categories  to  which 
pattern  vectors  should  belong,  by  using  random  number  generator  which 
produces  1  and  0  with  the  equal  probability.   Second,  the  i-th  pattern 
set  S  (i)  was  defined  as  follows: 

S  (i)  =  {  T(i\  ....  T(  i  +  29))  ,  i-  1,  2,  ...  ,  31,  where 
I    ,  j  =  1,  2,  . ..,  60,  is  the  j-th  generated  pattern  vector.   These 
31  problems  corresponding  to  S  (l) ,  . ..,  S  (31)  were  transformed  into  the 
sets  of  linear  inequalities  as  illustrated  in  the  example  in  Section  k, 
and  then  solved  on  the  IBM  3oO/75  computer  by  using  IBM  Mathematical  Pro- 
gramming System.   (Although  MPS  cannot  perform  exactly  the  same  procedure 
as  described  in  Section  k,    an  equivalent  modified  test  was  tried  in  order 
to  observe  the  number  of  necessary  pivot  operations. ) 

The  result  is  that  when  we  solve  the  31  problems  separately,  the 
average  number  of  pivot  operations  is  51*3  for  each  problem  whereas  18.5 
pivot  operations  are  needed  in  our  case  of  the  adaptation  scheme.   The 
saving  of  pivot  operations  was  about  two  third.   This  may  be  encouraging 
because  the  tested  problem  seems  difficult  for  our  approach  since  pattern 
vectors  which  are  coming  in  or  going  out  are  generated  independently  of 
other  pattern  vectors  in  the  set  (i.e.,  by  the  random  number  generator) 
and  accordingly  the  new  solution  may  not  have  much  relationship  with  the 
old  one. 
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As  is  expected,  the  number  of  pivot  operations  fluctuates  more 
widely  in  our  adaptation  scheme  than  the  case  of  solving  31  problems 
separately.   This  is  because  the  adaptation  procedure  usually  needs  more 
pivot  operations  if  the  pattern  vector,  which  leaves  the  current  pattern 
set,  is  in  the  basis  of  the  last  simplex  tableau,  than  if  it  is  not  in 
the  basis. 
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8.     Conclusion 

We  have  discussed  the  adaptation  procedure  of  a  linear  classifier 
based  on  a  linear  programming  method,  making  use  of  the  easy  incorporation 
of  a  new  constraint  in  the  simplex  method  for  the  dual  formulation.   Its 
advantage  over  the  existing  ordinary  adaptive  procedures  (such  as  the 
error-correction  method        )is  that  the  optimality  of  a  parameter  of 
the  classifier,  such  as  the  input  tolerance  which  is  a  measure  of  the 
reliability  of  the  classifier  operation,  is  guaranteed. 

Even  when  a  given  set  of  pattern  vectors  is  not  linearly  separable, 
the  adaptation  procedure  b^sed  on  integer  linear  programming  attains  the 
minimality  of  the  number  of  incorrectly  separated  pattern  vectors  as  well 
as  the  input  tolerance  of  the  classifier. 

On  the  other  h^nd,  the  disadvantage  of  the  approach  is  the  need 
of  somewhat  complicated  computation  for  each  adaptation,  and  the  need  of 
storage  space  for  the  simplex  tableau. 

The  computational  experiment  in  Section  7  is  encouraging  and  may 
indicate  the  feasibility  of  this  kind  of  system. 


1+3 


Acknowledgment 
The  authors  would  like  to  appreciate  C.  R.  Baugh  for  his 
assistance  in  performing  the  experiment  of  Section  7  and  for  valuable 
comments  on  the  manuscript. 


kk 


References 

(1)  N.  J.Nilsson  Learning  Machines,  McGr=>w-Hill,  1963,  New  York. 

(2)  F.  Rosenblatt,  Principle  of  Neurodynamics,  Spartan  Books, 
Washington  D.  C,  1961 

(3)  0.  L.  M^ngasarian,  "Linear  and  Nonlinear  Separation  of  Patterns 

by  Linear  Programming",  Operations  Research,  vol.  13,  No.  3,  pp.  kkk-k^2, 
May- June  19&5- 

(k)      F.  W.  Smith,  "Pattern  Classifier  Design  by  Linear  Programming" 
IEEE  Trans,  on  Computers,  vol.  C-17,  No.  k,    pp.  367-372, 
April  1968. 

(3)  S.  Muroga,  Threshold  Logic,  Lecture  notes  for  EE  U97  and  EE  i+98, 
Department  of  Computer  Science,  University  of  Illinois,  I965-I966. 

(6")   S.  Muroga,  "Majority  Logic  and  Problems  of  Probabilistic  Behavior", 
in  Self-Organizing  Systems,  Spartan  Books,  1962,  pp.  243-281. 

(7)  A.  J.  Goldman,  and  A.  W.  Tucker,  "Theory  of  Linear  Programming", 
in  Linear  Inequalities  and  Related  Systems,  edited  by  H.  W.  Kuhn 
and  A.  W.  Tucker,  Princeton  University  Press,  195^,  pp.  53-97- 

( Q^      G.  B.  Dantzig,  Linear  Programming  and  Extensions,  Princeton 
University  Press,  Princeton,  New  Jersey,  1963. 

( 9}      G.  Hadley,  Linear  Programming,  Addison -Wesley,  Addison -Wesley  Series 
in  Industrial  Management,  1962. 

flO)   R.  E.  Gomory,  'An  All-integer  Integer  Programming  Algorithm"  in 
Industrial  Scheduling,  Prentice -Hall  International  Series  in 
Management,  edited  by  Muth  and  Thompson,  Prentice-Hall,  1963* 
pp.  193-206. 


45 


s 


'% 


^ 


